Search CORE

16 research outputs found

Online estimation of discrete densities using classifier chains

Author: Frank Eibe
Geilke Michael
Kramer Stefan
Publication venue: ADReM
Publication date: 01/01/2012
Field of study

We propose an approach to estimate a discrete joint density online, that is, the algorithm is only provided the current example, its current estimate, and a limited amount of memory. To design an online estimator for discrete densities, we use classifier chains to model dependencies among features. Each classifier in the chain estimates the probability of one particular feature. Because a single chain may not provide a reliable estimate, we also consider ensembles of classifier chains. Our experiments on synthetic data show that the approach is feasible and the estimated densities approach the true, known distribution with increasing amounts of data

Research Commons@Waikato

Online density estimates : a probabilistic condensed representation of data for knowledge discovery

Author: Geilke Michael
Publication venue
Publication date
Field of study

The Internet of Things (IoT) and the data that is generated from its sensors are making new demands on data mining methods. These demands stem from the desire to benefit from the knowledge contained in this data and the increasing number of devices that are equipped with these sensors. According to companies like Intel or HP, the number of sensors worldwide is likely to reach more than one trillion by 2022. All of them will produce streams of measurements and leveraging knowledge from these streams requires infrastructure to analyze them in real-time. From a data mining perspective, this involves challenging tasks such as cleaning the data, handling large amounts of data, and preserving their privacy, to name a few. The state of the art in data mining already addressed some of these challenges, but the proposed methods are typically designed for a specific task (e.g., predicting a certain variable or finding frequent patterns) and perform this task while scanning the data stream. However, at the time of collecting the data, it is often not known what kind of analysis needs to be performed or there are several -- possibly even dependent -- analysis tasks. This means that whenever storing the original data is either not feasible due to the sheer volume or impossible due to privacy concerns, the user has to wait for more data to initiate another analysis task, which impedes the use of conventional data mining algorithms. Therefore, we present a framework in this thesis, called MiDEO (Mining Density Estimates inferred Online), which decouples the process of collecting the data from the actual analysis. It uses density estimates to maintain a compact representation of the data stream and provides inference capabilities to perform queries on them. The queries can be combined to complex data mining tasks and allow to adapt the estimates to the current needs of the user or the algorithm. Compared to current methods that typically focus on one task at a time, this enables a more interactive analysis of the data stream, where the task selection is part of the analysis. In the course of designing such a framework, we develop several methods to improve the state of the art. This includes online density estimators for conditional joint densities with mixed types of variables, an online density estimator for high-dimensional data, algorithms to perform pattern mining on online density estimates, an online density estimator that is able to represent recurrences in the data stream, and algorithms that enforce well-known privacy-preserving properties to protect the entities described by the data. To show the effectiveness of these methods, we prove some of their theoretical properties and perform an extensive set of experiments.Das Internet of Things (IoT) und die aus dessen Sensoren generierten Daten stellen neue Anforderungen an Data Mining Methoden. Diese Anforderungen gehen aus dem Wunsch hervor, von dem den Daten inhärenten Wissen zu profitieren sowie der wachsenden Anzahl von mit Sensoren ausgestatteten Geräten gerecht zu werden. Firmen wie Intel oder HP zufolge kann im Jahr 2022 mit über einer Trillion Sensoren weltweit gerechnet werden. All diese Sensoren werden Ströme von Messdaten produzieren, deren Echtzeit-Analyse eine angemessene Infrastruktur voraussetzt. Im Data Mining stellen sich damit primär neue Herausforderungen wie unter anderem das Bereinigen der Daten, der Umgang mit sehr großen Datenmengen sowie die Berücksichtigung der Privatsphäre. Führende Data Mining Methoden haben sich bereits mit einigen dieser Herausforderungen befasst, allerdings sind sie typischerweise auf eine bestimmte Data Mining Aufgabe zugeschnitten (z.B. das Vorhersagen einer Variablen oder das Finden von häufigen Mustern), die darüberhinaus beim Scannen des Datenstroms ausgeführt wird. Jedoch steht während des Sammelns der Daten üblicherweise nicht fest, welche Art von Analyse benötigt wird oder es sind mehrere -- gegebenenfalls voneinander abhängige -- Analysen erforderlich. Können Daten wegen ihres Volumens oder aus Gründen der Privatsphäre nicht gespeichert werden, ist der Nutzer gezwungen, auf neu eintreffende Daten zu warten, bevor er eine neue Analyse durchführen kann. Um diesem Problem entgegen zu gehen, präsentieren wir in dieser Arbeit das MiDEO (Mining Density Estimates inferred Online) Framework. Dieses entkoppelt den Prozess der Datensammlung von der Datenanalyse. Mittels Online-Dichteschätzern verfügt es über eine jederzeit aktuelle sowie kompakte Version des Datenstroms und stellt Inferenzalgorithmen zur Verfügung um Anfragen auf die Daten zu erlauben. Diese Anfragen können zu komplexen Data Mining Aufgaben kombiniert werden und erlauben dem Benutzer eine Anpassung gemäß den aktuellen Anforderungen. Verglichen mit herkömmlichen Methoden wird so eine interaktivere Analyse der Datenströme ermöglicht, wobei die Wahl der zu lösenden Data Mining Aufgabe Teil der Analyse ist. Im Zuge der Entwicklung dieses Frameworks haben wir mehrere kompetitive Methoden entwickelt. Dies beinhaltet Online-Dichteschätzer für bedingte Verbundwahrscheinlichkeiten mit gemischten Variablentypen, einen Online-Dichteschätzer für hochdimensionale Daten, auf Online-Dichteschätzern arbeitende Pattern Mining Algorithmen, einen Rekurrenzen darstellenden Online-Dichteschätzer sowie Algorithmen, welche die Privatsphäre, der in den Daten beschriebenen Individuen schützen. Die Effektivität dieser Methoden wird durch den Beweis einiger theoretischer Eigenschaften und umfangreiche Experimente belegt

Privacy-preserving pattern mining on online density estimates

Author: Geilke Michael
Kramer Stefan
Publication venue
Publication date
Field of study

Traditional pattern mining algorithms require access to the data, either in the form of a complete set of data, as in batch data mining, or in the form of a window of recent data, as in stream mining. In the case of stream mining, this comes with a number of disadvantages, such as the possibly unbounded growth of relevant instances, drift, possibly changing data mining tasks, and issues with privacy, to name a few. Therefore, an approach has been recently proposed that extracts patterns just from statistical information of the stream - more precisely, an online density estimate that is inferred from it. As this approach is mainly based on sampling from the density estimates, it still struggles with itemsets having a medium to low frequency. To resolve this issue, we pursue an alternative strategy in this paper and directly exploit the structure of the density estimates to extract frequent itemsets. Additionally, we address the important matter of privacy-preserving data mining by ensuring that the density estimate fulfills privacy-related properties. To show the effectiveness of the proposed methods, we provide proofs and evaluate the performance on synthetic and real-world data

Gutenberg Open

A probabilistic condensed representation of data for stream mining

Author: Geilke Michael
Karwath Andreas
Kramer Stefan
Publication venue
Publication date
Field of study

Data mining and machine learning algorithms usually operate directly on the data. However, if the data is not available at once or consists of billions of instances, these algorithms easily become infeasible with respect to memory and run-time concerns. As a solution to this problem, we propose a framework, called MiDEO (Mining Density Estimates inferred Online), in which algorithms are designed to operate on a condensed representation of the data. In particular, we propose to use density estimates, which are able to represent billions of instances in a compact form and can be updated when new instances arrive. As an example for an algorithm that operates on density estimates, we consider the task of mining association rules, which we consider as a form of simple statements about the data. The algorithm, called POEt (Pattern mining on Online density esTimates), is evaluated on synthetic and real-world data and is compared to state-of-the-art algorithms

Gutenberg Open

Modeling recurrent distributions in streams using possible worlds

Author: Geilke Michael
Karwath Andreas
Kramer Stefan
Publication venue
Publication date: 01/01/2015
Field of study

Discovering changes in the data distribution of streams and discovering recurrent data distributions are challenging problems in data mining and machine learning. Both have received a lot of attention in the context of classification. With the ever increasing growth of data, however, there is a high demand of compact and universal representations of data streams that enable the user to analyze current as well as historic data without having access to the raw data. To make a first step towards this direction, we propose a condensed representation that captures the various - possibly recurrent - data distributions of the stream by extending the notion of possible worlds. The representation enables queries concerning the whole stream and can, hence, serve as a tool for supporting decision-making processes or serve as a basis for implementing data mining and machine learning algorithms on top of it. We evaluate this condensed representation on synthetic and real-world data

CiteSeerX

University of Birmingham Research Portal

Gutenberg Open

Modeling recurrent distributions in streams using possible worlds

Author: Geilke Michael
Karwath Andreas
Kramer Stefan
Publication venue
Publication date: 01/01/2015
Field of study

University of Birmingham Research Portal

Online density estimation of heterogeneous data streams in higher dimensions

Author: Geilke Michael
Karwath Andreas
Kramer Stefan
Publication venue: Springer
Publication date: 04/09/2016
Field of study

The joint density of a data stream is suitable for performing data mining tasks without having access to the original data. However, the methods proposed so far only target a small to medium number of variables, since their estimates rely on representing all the interdependencies between the variables of the data. High-dimensional data streams, which are becoming more and more frequent due to increasing numbers of interconnected devices, are, therefore, pushing these methods to their limits. To mitigate these limitations, we present an approach that projects the original data stream into a vector space and uses a set of representatives to provide an estimate. Due to the structure of the estimates, it enables the density estimation of higher-dimensional data and approaches the true density with increasing dimensionality of the vector space. Moreover, it is not only designed to estimate homogeneous data, i.e., where all variables are nominal or all variables are numeric, but it can also estimate heterogeneous data. The evaluation is conducted on synthetic and real-world data. The software related to this paper is available at https://github.com/geilke/mideo

University of Birmingham Research Portal

Gutenberg Open

A probabilistic condensed representation of data for stream mining

Author: Geilke Michael
Karwath Andreas
Kramer Stefan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

University of Birmingham Research Portal

Gutenberg Open